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ABSTRACT 

The time people spend in front of computers has been increasing 
steadily due to the role computers play in modem society. Individ¬ 
uals who sit in front of computers for an extended period of time, 
specifically with improper postures may incur various health issues. 
In this work, individuals’ behaviors in front of computers are stud¬ 
ied using web cameras. By means of non-rigid face tracking system, 
data are analyzed to determine the 3D head pose, blink rate and yawn 
frequency of computer users. When combining these visual cues, a 
system of intelligent personal assistants for computer users is pro¬ 
posed. 

Index Terms — face tracking, recommendation, ergonomics 

1. INTRODUCTION 

There has been an increased interest in health status monitoring of 
computer users during the last decade, especially when laptops with 
embedded web cameras are becoming more and more popular. Ev¬ 
eryday we can see employees, students, or iTers spend most of their 
time sitting in front of the computers stiffly and staring at the screen 
continuously. Under such circumstances, only very few of them have 
a planned periodical mini-breaks. Consequently, a large number of 
computer users are suffering from musculoskeletal Repetitive Stress 
Injuries (RSIs) □ In order to solve this problem, an effective mech¬ 
anism should be built to remind computer users in advance. As a 
warning, however, people will generally feel tired after long time 
competitive working. Therefore, fatigue can serve as a good mark 
for the reminding. 

This work focuses on the fatigue detection and recommendation 
making for computer users by means of a single web camera. To 
achieve the aim, non-rigid face tracking method 0 was firstly ap¬ 
plied to the real-time camera video. As a result, the position data 
relating to the eyes, mouth and head areas were obtained. Secondly, 
blink detection, yawn detection and 3D head pose analysis were per¬ 
formed respectively on each frame to get the three fatigue features. 
Finally, the fuzzy logic system was used to fuse the three features 
and give health recommendations to computer users hereby. 

There are several ongoing research work on assistive robots 0 
to help elders, and people with special needs. Instead of building a 
robot, the computer can serve as the best robot for posture correc¬ 
tion and improvements for people who suffer from repetitive stress 
injuries, since the injuries are rooted from using the computers. The 
goal of our system is to reduce the health risks by providing sug¬ 
gestions to improve user posture and their productivity for the short 
and long term. An additional advantage is the strong computation 
power of the computers comparing to embedded system in robots. 
Our work has the potential to turn the computer into a personalized 
care taker of human beings. 


Our contributions are: 1) We propose a three layer system to 
prevent the user from potential health issues. 2) In the tracking layer, 
we improve the non-rigid face tracking algorithm by removing jitters 
on landmarks and reinitializing the tracking automatically. 3) In the 
feature layer, we propose an self-adaptive blink detection method. 4) 
In the recommendation layer, our inference framework using fuzzy 
logic to combine the expert rules and user’s feedbacks give users 
dynamic and personalized recommendations. 

The paper is organized as follows: Section 2 outlines the recent 
work in the vision based face tracking and assistive systems. Our 
system and algorithm are presented in Section 3 and Section 4 and 
Section 5. Experimental evaluations are shown in Section 6. Finally, 
we conclude the paper and discuss the future work in Section 7. 

2. RELATED WORK 

The proposed system may serve as an intelligent assistant for com¬ 
puter users. This is realized by detecting the users’ fatigue status 
using a web camera and making health recommendations for them. 
The related work involves the estimations of 3D head pose, blink 
rate and yawn frequency respectively. 

3D head pose estimation. Gaze direction is deeply related with 
a user’s attention J4]|. Therefore, accurate estimation of 3D head 
pose helps the system to judge if a user is in correct position when 
using the computer and then give recommendations thereby. How¬ 
ever, there is few generic solutions for identity-invariant head pose 
estimation. Detailed discussion of the inherent difficulties and evo¬ 
lution of this field were surveyed by Murphy [ 5 j in 2009. 

Blink rate estimation. In order to estimate blink rate, blink 
should be identified firstly. Global template matching (6) is a com¬ 
mon method for that. First of all, eye regions are detected. After 
the open status of eyes is trained, a global template is formed and 
can be applied to compare with real eye appearance. And the blink 
can be finally identified. In 2002, Morris et al. 0 proposed a real¬ 
time blink detection system, in which variance maps were used to 
find eye-feature points. In 2005, Chau et al. 0 brought forward 
a blink detection system using the correlation with an online tem¬ 
plate. Yang et al. 0 modeled the shape of eyes using a pair of 
parameterized parabolic curves, and then fit the model globally to 
find potential eye regions. In addition to template matching, there 
are still other methods regarding blink detection, which include sta¬ 
tistical algorithm by Pan et al. fTol . optical flow based algorithm by 
Divjak et al. ED. and facial feature based algorithm by Moriyama 
et al. Q2). 

Yawn frequency estimation. Yawn serves as an important fac¬ 
tor for the judgement of fatigue when combined with the 3D head 
pose and blink rate. Mohanty et al. fl3l proposed a non-rigid estima¬ 
tion algorithm for yawn detection, in which the degree of lip shape 



deformation was quantified. Du et al. ED proposed a kerneled fuzzy 
rough sets based yawn detection algorithm. In Omidyeganeh’s work, 
yawn was detected based on the aspect ratio of the extracted mouth 
area as compared with an experimentally tuned threshold 1T5) . Ab- 
tahi’s algorithm 01 focused on the calculation of mouth geometric 
feature changes. 


3. TRACKING LAYER 

3.1. Hardware and Software Fundamentals 
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Fig. 1. Overview of the non-rigid face tracking system. 


Web camera is widely available on personal computers. It is 
used in our system to track facial expressional features. And a non- 
rigid tracking algorithm is proposed based on Active Appearance 
Model (AAM) 03 fT8l . Overview of our tracking system is shown 
in Fig. [I] Seventy six landmarks are annotated manually on each 
image, from which linear shape model, correlation patch model and 
face detection model are trained. By combining these three model 
files, the tracking model is obtained and applied to track faces. 



Fig. 2. Visualization of 76 tracking landmarks. 

In order to improve the generalization performance and tracking 
accuracy, 385 face images with different age, expression and ethnic¬ 
ity groups are used for the training. Fifty eight of them are captured 
in our lab, and the remaining 327 images are selected out of the Biwi 
3D Audiovisual Corpus of Affective Communication database ED 
from the Swiss Federal Institute of Technology Zurich (ETHZ). As 
is shown in Fig. [2] 76 facial landmarks are located for tracking, of 
which 15 lie on the chin, 6 on each eyebrow, 9 on each eye, 12 on 
the nose and 19 on the mouth. 


3.2. Shape and Patch Models 

Figure [I] shows that shape models use a linear representation of the 
facial geometry to illustrate how landmarks vary across different 
people and expressions. The goal of linear modeling is to find a 
low-dimensional subspace within 2iV-dimensions (N represents the 
total number of landmarks). Principal Component Analysis (PCA) 
is applied to find the best subspace. As for the correlation-based 
patch models, we generate a group of image patches that will pro¬ 
duce strong responses at the exact location of landmarks based on the 
annotated dataset. Cross-correlation is calculated on patch models to 
estimate the feature locations and correct facial shape models. 

The tracking procedure always suffers landmark jittering, which 
makes it difficult to detect facial expressional movements like blink 
and yawn accurately. Occlusion is another problem in the tracking. 
Once part of the face is occluded, reinitiating of the tracking cannot 
be performed automatically. 

3.3. Jittering Removal 

Jittering of landmarks is very common during the tracking. There¬ 
fore its pattern can be learned by means of Maximum Likelihood Es¬ 
timation (MLE) and Bayesian Minimum Error Estimation (BMEE). 
In order to eliminate the ambiguity between facial movement and 
jittering, human face should keep still during the learning period for 
about 0.5 second in our system. 

Let uji be the event that the zth landmark jitters, and di repre¬ 
sent the offset of the zth landmark in the consecutive frames (i = 
1,2,..., N), where N is the total number of landmarks. Supposing 
that landmark jittering agrees with the Gaussian distribution, that is 

p(di\cji) ~ We have 


M M 


M 


( 1 ) 


where M stands for the total number of the training frames. 

According to Bayesian Minimum Error Estimation, instead of 
comparing the values of the posterior probability density p{uji\d) 
and p(uJi\d), we can compare the values of p{d\uoi)p(uji) and 
p(d\uJi)p(UJi), which is equivalent to comparing the values of 
p(d\uJi) with p(d\uJi)p(uH)/p(uJi). Given that p(ui) and p(uJi) 
are relatively fixed, and its difficult to get p(d\cJi), we can replace 
p(d\uJi)p(Ui)/p(uji) with a fixed threshold a. Therefore, the classi¬ 
fier can be defined as the following rules: 

Rl: IF p(cji\d) < a , THEN the offset is caused by facial move¬ 
ment. 

R2: IF p(cji\d) > a, THEN the it h tracking point jitters. 

Let (j>(x i) = a,(/)(x 2 ) = a,x± < X 2 , where <j)(x) and <£(#) 
stand for the standard normal probability density function and cu¬ 
mulative distribution function respectively. We have the error rate 
P e — &(x 2 ) — &(xi). Performance of the jittering removal is eval¬ 
uated in section 6. 

3.4. Automatic Reinitiating of Face Tracking 

Figure |3(a)| shows that once the tracker fails in detection, track¬ 
ing cannot be reinitiated automatically. This causes the landmark 
drift phenomenon. To solve this problem, Support Vector Machine 
(SVM) is used to discover the gold point for reinitiating. There are 
two factors that will affect tracking accuracy. One is the position of 
the 76 landmarks. And the other is the response value obtained from 
template matching. Consequently, the response value and the rela¬ 
tive (x, y) coordinates of each tracking point compose a 228 dimen¬ 
sional feature vector together, in which the relative coordinates are 
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Fig. 3. Comparison of the tracking effect between with and without 
auto-reinitiating. 

calculated by subtracting the spatial coordinates of the tracking point 
from its mass center coordinates. As is shown in Fig. |3(b)[ when the 
tracker loses the face, a cascade face detection |20) is restarted until 
it tracks the objects successfully. What is more, the model training 
and auto-reinitiating can be performed simultaneously in real-time. 
Its performance can be referred to the evaluation section. 


4. FEATURE LAYER 
4.1. 3D Head Pose Classification 

The pin-hole camera model is applied in our system. And 10 facial 
landmarks (No. 38-44, 46, 47, 67) are selected to estimate 3D head 
pose, because they mostly express rigid movement. Assuming the 
67th landmark is the origin in 3D space, the relative coordinates of 
other landmarks are obtained thereby. Then the algorithm in HQ 
is used to estimate the camera rotation and translation parameters. 
The corresponding relationship between the 3D space and the image 
plane is defined by 
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where s is a fixed coefficient. A is the camera intrinsic parame¬ 
ter matrix, which can be computed through calibration. [R\t] is the 
camera extrinsic parameter matrix. R is the rotation matrix and t is 
the translation vector, (x, y ) represents the coordinates in the im¬ 
age plane, and Y, Z) represents the coordinates in 3D space. 3D 
head pose is indicated by the extrinsic parameter matrix. 

It has been found in our experiment that the intrinsic parame¬ 
ters of most web cameras are approximately the same. Therefore, 
the camera intrinsic parameters are set to fixed values in our system 
instead of repeated calibration. 

An SVM classifier is applied to determine whether the user is in 
a correct pose while using computer. Several classes are defined as: 

Posel: The user is not looking at or not in front of the computer. 

Pose2: The user is in a correct pose. 

Pose3: The user is too close to the screen. 

Pose4: The user is with his/her head askew to the left. 

Pose5: The user is with his/her head askew to the right. 


The 6 dimensional feature vector contains the rotation and trans¬ 
lation parameters. The classification data obtained in this layer will 
be used in the following section to make recommendations for the 
users. 



Fig. 4. 3D head pose classification. The cyan, yellow and blue line 
represent the x, y and z axis respectively. The pose number 1.00-5.00 
corresponds to the class C1-C5 respectively. 


4.2. Self-Adaptive Blink Detection 

A self-adaptive algorithm is designed to better detect the blink un¬ 
der different conditions, as is shown in Algorithm[I] The idea comes 
from the intuition that eye closure state only occupies a small propor¬ 
tion of the working time. And the eyeball patch color is completely 
different between closed and open eye state. Therefore, the average 
color of the eyeball patch C can be obtained and its changes can 
be monitored in real-time. In addition, C is normalized as C — 
(C — EC)/vVarC to minimize the environmental disturbances 
during the tracking, where EC and VarC represent expectation and 
variance of C. In this way, a fixed threshold Ct can be applied to 
predict whether the eyes are open or closed. If the system find that 
the user’s eye state has switched from open to closed, then blink is 
detected. 


Algorithm 15 = Eye(J, Pi, P r , d, Wi , hi , C t , r f ) 

Input: / represents the image captured from the web camera. I.r, 
Eg and Lb represent the three color channels of the image. Pi 
and P r represent the (x, y ) coordinates of the left and right eye¬ 
balls respectively, where Pi .x and Pi .y are the x and y coordinate 
respectively, d denotes the distance between the face and the web 
camera. Wi and hi are the index used to set the width and height 
of the patch around the eyeball. Ct is a fixed threshold forjudging 
the eye state. 77 is the frame rate of the web camera. 

Output: 5 is the binary classification result, where 1 represents the 
eye open state, and 0 represents the eye closure state. 


EC = 0, VarC = 0 


total frame count: f c — 0 

patch size: width w p = Wi/d, height h p = hi/d 
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average color: C = —— 

f c a f c + 1, EC = CvwrC = VarCx r / + (C'--EC') 2 

J c J 0 ’ rj+l rj + l 


normalization: C = c , EC _ 

V V arC 

if f c > Tf then 

if C < Ct then 5 = 1, eye is open 
else <5 = 0, eye is closed 

end if 

else Initialization process with no outcome 

end if 






























4.3. Yawn Detection 

A S VM classifier is designed to determine whether the mouth is open 
or closed. The SYM feature vector includes the (x, y ) coordinates of 
mouth landmarks (No. 48-66). If the mouth keeps open for a preset 
time threshold t t , then a yawn is detected. 



Fig. 5. Prediction of mouth condition. 1.00 represents closed mouth 
and 2.00 represents open mouths. 


5. RECOMMENDATION LAYER 
5.1. Recommendation Framework 

After the features are gathered from the above layer, health rec¬ 
ommendations can be made for the users in front of web cameras. 
Firstly, several rules are defined, which may be obtained from er¬ 
gonomics experts or doctors. A few examples are given as follows: 

R±: IF the user works more than 30 minutes, THEN take a 
break. 

R 2 : IF the user keeps in a bad pose for more than 10 minutes, 
THEN raise the alarm. 

R 3 : IF the user yawns more than 5 times in a 10 minutes period, 
THEN take a break. 

Each rule consists of a set of premises and a consequence, which 
is a recommendation generated by the system. The fuzzy logic sys¬ 
tem is used to formulate the recommendation logic in the following 
way: 


f(x\ 0 ) = 


Ef=i bi n?=A 


( 3 ) 
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where R is the number of rules and n is the number of inputs in each 
rule. We assume that each premise has a score, which is denoted by 
the confidence from the feature layer. f(x\9) is the predicted output 
given an input data-tuple Xj . 9ij is the confidence score associated 
with the j th premise in the rule Ri. bi is the weight associated with 
the rule Ri. All the weights need to be learned before the recommen¬ 
dation system works. And f(x\9) represents the recommendation. 
For instance, f(x\ 6 ) = 1 means taking a break and f(x\9) — —1 
means keeping working. The weights can be obtained by means of 
batch least square estimation. 

For a rule with n premises, we define: 


yi(x) = Uj =1 9ij, 


( 4 ) 
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Therefore, f(x\9) = b T £(x), where b = (bi,...,bR) T . The un- 
known b can be easily resolved using the least squre estimation. In 
order to train the recommendation system, R rules and at least R 
confidence scores and corresponding actions are needed. 


5.2. Dynamic Adaptation 

Users may provide explicit or implicit feedbacks to the system. Ex¬ 
plicit feedbacks are gathered from the users’ actions as clicking a 
dislike button. Implicit feedbacks are learned from visual clues such 
as the users continue working after the system suggests to take a 
break. In our system, only explicit feedbacks are considered. Vector 
b is updated by means of the Equation [6] 

b^ = (1 — a)b <<t ~ 1 ^ + ab *, (6) 

where b* is calculated by solving the normal equation using the 
users’ feedback data, a is the adaptation rate, which means the sys¬ 
tem will gradually learn to adapt to the users’ preferences. 

6. EVALUATIONS 

Experiments are performed on 5 randomly picked volunteers, which 
includes 4 males and 1 female. They were asked to do regular com¬ 
puter work in front of their PCs for at least 10 minutes under dif¬ 
ferent conditions. 640 x 480 camera resolution is tested, and the 
volunteers’ faces may be occluded by glasses or hands. Since many 
people work in dim environments with computers, three of the tests 
are performed under the poor illuminated condition, which may lead 
to blur in facial features like eyes. The accuracy of blink detection, 
yawn detection are evaluated quantitatively and all the data are inte¬ 
grated to provide suggestions in our system. 

As for the blink and yawn detection, we manually calculate the 
hitting rate and false detection rate of the system on all volunteers 
with the blink threshold Ct = 2.5 and mouth open time thresh¬ 
old t t — 1.5 seconds for best performance. Ninety five blinks and 
twenty five yawns are taken into consideration, among which 20 
blinks and 5 yawns are from each volunteer. The hitting rate are 
90.5% and 100% for the blink and yawn respectively, and the false 
detection rate are 12.6% and 8.3% respectively. 

To provide suggestions, the working time is divided into 10 min¬ 
utes periods separately and the users’ status in each period are de¬ 
termined based on the blink count, yawn count and 3D head pose. 
Figure[6]shows the outcome of our system when tracking a volunteer 
working for more than 6 hours in front of the computer. When com¬ 
paring the outcome with the real condition, it can be found that the 
user’s absence in front of the computer in 10:18-10:28, 11:31-11:42 
and 13:06-13:27 periods is due to mini-breaks. The user’s absence in 
12:13-12:55, 15:02-15:12 and 15:23-15:33 periods is due to lunch, 
paper work and group discussion respectively. Moreover, the right 
subfigure shows that the user keeps in a bad pose most of the time 
and works continuously for more than 30 minutes, which will defi¬ 
nitely trigger the system’s alarm. It also shows that the user is in a 
potential fatigue condition during 14:30-15:00 due to the increase of 
yawn rate. It is because the user used to have a nap at noon but did 
not that day due to the experiment, which made him feel tired in the 
afternoon. 


7. CONCLUSION 

In this work, a system is presented which combines non-rigid face 
tracking with feature analysis to determine the working status of 
computer users. Jittering removal and auto-reinitiating methods are 
designed to improve the performances of traditional face tracking 
algorithms, and statistical learning methods are applied in the fea¬ 
ture analysis. By using the blink, yawn detection and 3D head pose 
analysis solution, the working status of the computer users can be 
predicted and the recommendation rules can be made. Future work 
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Fig. 6. Outcome of a volunteer working for more than 6 hours in front of the computer. Left subfigure shows the proportion of each head pose 
during every ten minutes period. Right subfigure shows the prediction of the user’s status, the blink and yawn counts during each period. 


will focus on developing a user-based model to generalize the per¬ 
formance of the system, which will improve the tracking accuracy 
of tracking and feature detection. 
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